Visualization is a powerful tool for data exploration. But in the general case input dimension is high, so the visualisation is hard task. The dimension reduction is a method that allow to reduce data dimension for visualisation and for other purposes.
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable (hard to control or deal with). Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.
Wiki link
import pandas as pd
import numpy as np
import plotly.express as px
#Save Plotly figures with the interactive mode in HTML file
import plotly
plotly.offline.init_notebook_mode()
For more details about this dictionnry, please see the project entiment analysis with Naive Bayes Vs LSTM keras model
file=r'G:\Mon Drive\Personnel\05_Python_html_ext_code\08_AI_&_data science\Sentiment_analysis\Glove.npz'
loaded = np.load(file,allow_pickle=True)
Glove=loaded['Glove'].tolist()
words=['car','bus','train', 'woman','man','child','france','italy','germany']
category=['transport','transport','transport','human','human','human','country','country','country']
len(words),len(category)
(9, 9)
X=[]
for w in words:
X.append(Glove[w].tolist())
X=np.array(X)
X.shape
(9, 50)
Xn=(X-X.mean(axis=0))/X.std(axis=0)
Xn.shape
(9, 50)
*The input data has 50 columns, so it is hard to plot all this columns, The solution is to use the PCA algorithm to reduce the dimension from 50 to 2 only*
COV=np.cov(Xn, rowvar=False)
COV.shape
(50, 50)
For more information about the EigenValues/Vectors, please see the project Eigenvalues / EigenVectors, Covariance, Correlation
EigenVals, EigenVecs = np.linalg.eigh(COV)
EigenVecs.shape
(50, 50)
EigenVals
array([-2.88818316e-15, -2.53898779e-15, -2.38941499e-15, -2.27499066e-15,
-2.23097527e-15, -1.90785440e-15, -1.74317347e-15, -1.72006427e-15,
-1.21593304e-15, -1.17649979e-15, -9.99778220e-16, -8.60631073e-16,
-7.56937735e-16, -6.74129019e-16, -5.63288864e-16, -4.45314686e-16,
-4.19832605e-16, -3.93899035e-16, -3.22564427e-16, -1.87234899e-16,
3.61433213e-18, 5.69336912e-17, 1.52957220e-16, 2.90611120e-16,
4.07079914e-16, 4.32576628e-16, 4.96066786e-16, 5.53562087e-16,
7.22395715e-16, 7.33001616e-16, 8.55283910e-16, 9.73504182e-16,
1.19645835e-15, 1.32362356e-15, 1.70395447e-15, 1.74594450e-15,
2.14249559e-15, 2.22708761e-15, 2.61559632e-15, 2.89292472e-15,
3.36249024e-15, 4.82687959e-15, 1.10442638e+00, 2.05198275e+00,
3.37678465e+00, 3.63730471e+00, 5.36501108e+00, 6.01310240e+00,
1.42403286e+01, 2.04610594e+01])
# Sort the eigenValues: Descending
index=np.argsort(EigenVals)[::-1]
index
array([49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33,
32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16,
15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
dtype=int64)
# EigenValues sorting
EigenVals=EigenVals[index]
EigenVals
array([ 2.04610594e+01, 1.42403286e+01, 6.01310240e+00, 5.36501108e+00,
3.63730471e+00, 3.37678465e+00, 2.05198275e+00, 1.10442638e+00,
4.82687959e-15, 3.36249024e-15, 2.89292472e-15, 2.61559632e-15,
2.22708761e-15, 2.14249559e-15, 1.74594450e-15, 1.70395447e-15,
1.32362356e-15, 1.19645835e-15, 9.73504182e-16, 8.55283910e-16,
7.33001616e-16, 7.22395715e-16, 5.53562087e-16, 4.96066786e-16,
4.32576628e-16, 4.07079914e-16, 2.90611120e-16, 1.52957220e-16,
5.69336912e-17, 3.61433213e-18, -1.87234899e-16, -3.22564427e-16,
-3.93899035e-16, -4.19832605e-16, -4.45314686e-16, -5.63288864e-16,
-6.74129019e-16, -7.56937735e-16, -8.60631073e-16, -9.99778220e-16,
-1.17649979e-15, -1.21593304e-15, -1.72006427e-15, -1.74317347e-15,
-1.90785440e-15, -2.23097527e-15, -2.27499066e-15, -2.38941499e-15,
-2.53898779e-15, -2.88818316e-15])
# EigenValues sorting
EigenVecs=EigenVecs[:,index]
This matrix will not be served in the current project
S=np.diag(EigenVals)
S.shape
(50, 50)
Out_dim=2
Sub_EigenVecs=EigenVecs[:,:Out_dim]
Sub_EigenVecs.shape
(50, 2)
Reminder of the input shape
X.T.shape
(50, 9)
Xr=Sub_EigenVecs.T.dot(X.T).T
Xr.shape
(9, 2)
Reminder of the input shape
X.shape
(9, 50)
df=pd.DataFrame(Xr,columns=['x1','x2'])
df['word']=words
df['category']=category
df
| x1 | x2 | word | category | |
|---|---|---|---|---|
| 0 | 0.823152 | 2.500449 | car | transport |
| 1 | 0.960812 | 3.640771 | bus | transport |
| 2 | 0.436880 | 2.558087 | train | transport |
| 3 | 2.452261 | -1.506051 | woman | human |
| 4 | 2.094862 | -1.170678 | man | human |
| 5 | 2.403375 | -1.911725 | child | human |
| 6 | -3.407146 | -1.153402 | france | country |
| 7 | -3.175158 | -0.729738 | italy | country |
| 8 | -3.332977 | -1.022581 | germany | country |
fig=px.scatter(df,x='x1',y='x2',color='category',text='word',width=800, height=600,
title='The transmorde data')
fig.show()
We can remark that each category is grouped in a specify area of the figure
def PCA (X,Out_dim=2,std_norm=False):
# X.shape: mxn , m is the nember of raws in data, n is the number of columns (features)
# Out_dim: the wanted output dimension, 2 is good for visualisation
# std_norm:if True, normalize X using STD also
# Normalization
if std_norm:
Xn=(X-X.mean(axis=0))/X.std(axis=0)
else:
Xn=(X-X.mean(axis=0))
# Covariance
COV=np.cov(Xn, rowvar=False)
#======== Singular Value Decomposition (SVD)================
# Eigenvectors
EigenVals, EigenVecs = np.linalg.eigh(COV)
# Sort the eigenValues: Descending
index=np.argsort(EigenVals)[::-1]
# Use the index to get a eigenVectors with the same order
EigenVecs=EigenVecs[:,index]
# get a sub EigenVectors
U=EigenVecs[:,:Out_dim]
#============================================================
# Comput the reduced X
Xr=U.T.dot(X.T).T
return Xr
Xr2=PCA (X,Out_dim=2)
df2=pd.DataFrame(Xr2,columns=['x1','x2'])
df2['word']=words
df2['category']=category
df2
| x1 | x2 | word | category | |
|---|---|---|---|---|
| 0 | 1.833871 | 2.239016 | car | transport |
| 1 | 2.182165 | 3.049211 | bus | transport |
| 2 | 1.314006 | 2.543402 | train | transport |
| 3 | 1.893575 | -2.657787 | woman | human |
| 4 | 1.778322 | -2.000878 | man | human |
| 5 | 1.787517 | -2.536059 | child | human |
| 6 | -3.665965 | 0.056487 | france | country |
| 7 | -3.298916 | 0.007370 | italy | country |
| 8 | -3.535607 | 0.202911 | germany | country |
fig=px.scatter(df2,x='x1',y='x2',color='category',text='word',width=800, height=600,
title='The transmorde data with PCA function')
fig.show()
from sklearn.decomposition import PCA as SklearnPCA
For more information see the link
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
pca = SklearnPCA(n_components=2)
pca.fit(X)
PCA(n_components=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
PCA(n_components=2)
Xr_sklearn=pca.transform(X)
dfs=pd.DataFrame(Xr_sklearn,columns=['x1','x2'])
dfs['word']=words
dfs['category']=category
dfs
| x1 | x2 | word | category | |
|---|---|---|---|---|
| 0 | -1.801764 | 2.138607 | car | transport |
| 1 | -2.150057 | 2.948803 | bus | transport |
| 2 | -1.281898 | 2.442994 | train | transport |
| 3 | -1.861468 | -2.758195 | woman | human |
| 4 | -1.746214 | -2.101286 | man | human |
| 5 | -1.755409 | -2.636467 | child | human |
| 6 | 3.698072 | -0.043921 | france | country |
| 7 | 3.331024 | -0.093039 | italy | country |
| 8 | 3.567715 | 0.102503 | germany | country |
fig=px.scatter(dfs,x='x1',y='x2',color='category',text='word',width=800, height=600,
title='Sklearn transmorde data')
fig.show()
dfs.x1*=-1
fig=px.scatter(dfs,x='x1',y='x2',color='category',text='word',width=800, height=600)
fig.show()
The local PCA function and the Sklearn PCA function have the same result.
Xr3=PCA (X,Out_dim=3)
Xr3.shape
(9, 3)
df3=pd.DataFrame(Xr3,columns=['x1','x2','x3'])
df3['word']=words
df3['category']=category
df3
| x1 | x2 | x3 | word | category | |
|---|---|---|---|---|---|
| 0 | 1.833871 | 2.239016 | 1.986809 | car | transport |
| 1 | 2.182165 | 3.049211 | 0.101166 | bus | transport |
| 2 | 1.314006 | 2.543402 | -0.456373 | train | transport |
| 3 | 1.893575 | -2.657787 | 0.990900 | woman | human |
| 4 | 1.778322 | -2.000878 | 2.066705 | man | human |
| 5 | 1.787517 | -2.536059 | -1.404173 | child | human |
| 6 | -3.665965 | 0.056487 | 0.787283 | france | country |
| 7 | -3.298916 | 0.007370 | 0.628147 | italy | country |
| 8 | -3.535607 | 0.202911 | 0.335505 | germany | country |
fig=px.scatter_3d(df3, x='x1', y='x2', z='x3',color='category',
text='word',width=800, height=600,title='3D ploting of the transormed data')
fig.show()
In this netbook we developed a PCA algorithm for dimensionality reduction using only NumPy library, we compared also the result with Sklearn PCA transformation.